Introduction

Police shootings has been in the nightly news for as long as TV has existed, yet each new shooting brings everyone to shilling realizations. This study aims to shed some light on such dark themes and equip people with the facts.

This data is a list of fatal police shootings from 2015 as recorded by the Washington Post that details the deaths of United States citizens at the hands of police officers. It includes information of the names of the citizens, the manner of death, if the citizen was armed, age, gender, city, and date.

Due to the personal nature of loss, the research team omitted the names of the deceased out of respect.

R Configuration

Below we display our sessionInfo() for replication purposes.

sessionInfo(package=NULL)
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.3    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.10    stringi_1.1.5  
##  [9] rmarkdown_1.4   knitr_1.15.1    stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10

Obtaining The Data Set

The data set was originally found on data.world, uploaded by a user named carlvlewis who claims that the data set is updated daily. For the purpose of this study we used the latest verison of the data that was available.

The data that was originally obtained was raw and contained private information regarding real people, so the data was cleaned using the lapply and gsub functions. The team removed the names of the deceased and made the race variable clearer. This cleaned data set was upload to the project data set on data.world.

To obtain a copy of this data follow the following steps.

  1. Copy and paste the following link into your prefered web browser: https://data.world/robin-stewart/s-17-dv-final-project
    1. At the top-right section of your screen is a blue download button, that when clicked will download a zip file of the data set.
    2. Alternatively, scroll down to see a sample of the csv dataset. Click on the download button located to the right of the Explore button to download a csv file of the data set.

The following is a summary of the cleaned data set:

  summary(fatalPoliceShootings)
##        id                               name              date     
##  Min.   :   3.0   TK TK                   :  21   2015-07-07:   8  
##  1st Qu.: 650.2   Brandon Jones           :   2   2015-12-14:   8  
##  Median :1203.5   Daquan Antonio Westbrook:   2   2016-01-27:   8  
##  Mean   :1204.1   Eric Harris             :   2   2016-12-21:   8  
##  3rd Qu.:1768.5   Jamake Cason Thomas     :   2   2017-01-24:   8  
##  Max.   :2333.0   Michael Johnson         :   2   2017-02-03:   8  
##                   (Other)                 :2059   (Other)   :2042  
##          manner_of_death          armed           age        gender  
##  shot            :1941   gun         :1145   Min.   : 6.00   F:  86  
##  shot and Tasered: 149   knife       : 310   1st Qu.:26.00   M:2004  
##                          unarmed     : 152   Median :34.00           
##                          vehicle     : 132   Mean   :36.52           
##                          undetermined:  98   3rd Qu.:45.00           
##                          toy weapon  :  89   Max.   :86.00           
##                          (Other)     : 164   NA's   :44              
##  race              city          state      signs_of_mental_illness
##   : 119   Los Angeles:  31   CA     : 350   False:1571             
##  A:  31   Phoenix    :  24   TX     : 191   True : 519             
##  B: 520   Houston    :  23   FL     : 125                          
##  H: 356   Chicago    :  22   AZ     :  93                          
##  N:  28   Las Vegas  :  16   CO     :  63                          
##  O:  28   Austin     :  15   OK     :  63                          
##  W:1008   (Other)    :1959   (Other):1205                          
##        threat_level           flee      body_camera 
##  attack      :1347              :  34   False:1866  
##  other       : 614   Car        : 312   True : 224  
##  undetermined: 129   Foot       : 245               
##                      Not fleeing:1421               
##                      Other      :  78               
##                                                     
## 

ETL Data Cleaning

The following is the code we used to clean the data. Specifically we used the ETL to remove the victim’s names, change the name of the state column from lowercase characters to uppercase. Then we changed specific values to be more descriptive. For example, we changed individual’s race from one letter (eg, ‘H’) to the full word (eg, ‘Hispanic’). This was done for our convience to be able to indentify race quicker. The same thing was done to the gender column, switching ‘F’ and ‘M’ to ‘Female’ and ‘Male’ respectively. Finally, we begin to parse through the dimensions of the data frame to remove symbols that would could case errors later when using the actual data.

require(readr)
require(plyr)

file_path = "../01 Data/fatal-police-shootings-data.csv"
df <- read.csv(file_path, header=TRUE, stringsAsFactors=FALSE)
df$name <- NULL
names(df)

str(df)
measures <- c("id", "age")
dimensions <- setdiff(names(df), measures)
dimensions

for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}

df["state"] <- data.frame(lapply(df["state"], toupper))

df$race <- gsub("W", "WHITE", df$race)
df$race <- gsub("^[H]", "HISPANIC", df$race)
df$race <- gsub("^[B]", "BLACK", df$race)
df$race <- gsub("^[N]", "NATIVE AMERICAN", df$race)
df$race <- gsub("^[A]", "ASIAN", df$race)
df$race <- gsub("^[O]", "OTHER", df$race)
df["race"]

df$gender <- gsub("F", "FEMALE", df$gender)
df$gender <- gsub("^[M]", "MALE", df$gender)
df["gender"]

head(df)

na2emptyString <- function (x) {
  x[is.na(x)] <- ""
  return(x)
}
if(length(dimensions) > 0) {
  for(d in dimensions) {
    # Change NA to the empty string.
    df[d] <- data.frame(lapply(df[d], na2emptyString))
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
  }
}

na2zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}
if( length(measures) > 1) {
  for(m in measures) {
    print(m)
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
    df[m] <- data.frame(lapply(df[m], na2zero))
    df[m] <- data.frame(lapply(df[m], function(x) as.numeric(as.character(x))))
  }
}
str(df)

write.csv(df, gsub("-data", "-cleaned", file_path), row.names=FALSE, na = "")

Census Data

The U.S. Census Bureau and data.world have recently anounced a partnership which has resulted in data.world being host of the Census Bureau’s biggest annual household survey, the American Community Suvery.

Through the official US Census Buearu data.world profile, census data will be offered for any person to use and analyze: https://data.world/uscensusbureau

We used a 2011-2015 Income of US Population Estimates data for this particular study. This dataset was found here: https://data.world/uscensusbureau/acs-2015-5-e-income

Using the data.world R package, the 2011-2015 Income of US Population dataset and the Fatal Police Shootings were queried and pulled into RStudio and combined using dplyr.

Here is a summary of the combined dataset:

summary(incomeOfTheFatallyShot)
##        X              State         GINI        Per_Capita_Income
##  Min.   :  1.00   CA     :15   Min.   :0.4181   Min.   :22798    
##  1st Qu.: 25.75   TX     : 8   1st Qu.:0.4618   1st Qu.:25737    
##  Median : 50.50   OR     : 6   Median :0.4753   Median :26999    
##  Mean   : 50.50   AL     : 5   Mean   :0.4706   Mean   :27979    
##  3rd Qu.: 75.25   FL     : 5   3rd Qu.:0.4801   3rd Qu.:30318    
##  Max.   :100.00   TN     : 5   Max.   :0.5317   Max.   :47675    
##                   (Other):56                                     
##  Median_Family_Income Median_Non_Family_Income Median_Income  
##  Min.   :51782        Min.   :23027            Min.   :41371  
##  1st Qu.:57856        1st Qu.:28639            1st Qu.:47507  
##  Median :62717        Median :31848            Median :51243  
##  Mean   :64588        Mean   :33022            Mean   :53300  
##  3rd Qu.:70720        3rd Qu.:37909            3rd Qu.:61062  
##  Max.   :90089        Max.   :61466            Max.   :74551  
##                                                               
##        id               date            manner_of_death        armed   
##  Min.   :   3   2016-01-27: 8   shot            :91     gun       :55  
##  1st Qu.:1164   2016-01-16: 6   shot and Tasered: 9     knife     :16  
##  Median :1188   2016-01-17: 5                           toy weapon:11  
##  Mean   :1067   2016-01-18: 5                           unarmed   : 7  
##  3rd Qu.:1219   2016-01-31: 5                           vehicle   : 7  
##  Max.   :1288   2016-02-04: 5                           chain saw : 1  
##                 (Other)   :66                           (Other)   : 3  
##       age           gender                race             city   
##  Min.   :12.00   FEMALE: 6   ASIAN          : 2   Kansas City: 2  
##  1st Qu.:28.00   MALE  :94   BLACK          :17   Mesa       : 2  
##  Median :35.50               HISPANIC       :18   San Antonio: 2  
##  Mean   :36.91               NATIVE AMERICAN: 3   Acworth    : 1  
##  3rd Qu.:45.00               OTHER          : 1   Albuquerque: 1  
##  Max.   :64.00               WHITE          :58   Aloha      : 1  
##                              NA's           : 1   (Other)    :91  
##  signs_of_mental_illness threat_level          flee    body_camera
##  false:66                attack:57    Car        :18   false:87   
##  true :34                other :43    Foot       :16   true :13   
##                                       Not fleeing:63              
##                                       Other      : 3              
##                                                                   
##                                                                   
## 

The columns of GINI, Per_Capita_Income, Median_Family_Income, and Median_Non_Family_Income were added from the census data and matched on state of the individual shot. This dataset was saved as a CSV for later use in Tablaeu and R Markdown.

To create interesting visualizations, we must first understand what the combined data means. The columns that were queried from data.world were GINI, Per_Capita_Income, Median_Family_Income, and Median_Non_Family_Income. These are state-level summaries, meaning that each fatal shootings has per capita income, median family income, median non-family income information based on the state in which the shooting took place. Although this mixes individualized data with general data, it sheds some light into the type of environment and socioeconomic context the shooting took place.

Non-Aggregated Measures Analysis

Box Plot The Median Family Income vs. Fleeing plot is shown above. This plot represents a boxplot example. The color is for the gender. Generally you see that that average median family income for all of the data points fall roughly between 70 K and 55 K. There are few outliers outside of this range meaning that the majority of data points are similar.

  plot(boxplot)

This plot shows the median family income sorted by how the person fled from the police. We can see that the average of the state’s median family income for people who fled on foot is higher than the other ways of fleeing. This plot differs from the Tableau plot in that R/Shiny cannot display individual dots. The sampling from the data is also different in Shiny so the numbers are different we see a higher average median family income for people fleeing on foot.

Aggregated Measures Analysis

Histogram This dual axis histogram represents the per capita income across the x axis. The left axis has the count of the per capita income and the right has the average median income. The blue bars go with the left axis and the orange dots go with the right. There is also a general average line displayed in addition to a quarter page system, the current quarter is Q1. It’s interesting to note that the peak per capita income count peaks at 26K and the average median income steadily increases with the increase in per capita income.

Histogram This is the same visualization as shown above except that this visualization also includes actions when selecting specific data from the histogram.

  plot(histogram)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this histogram, we plot the counts of per capita income for people shot by the police. We can see in this graph that the majority of people are from states that have low per capita income (less than 30k). This is different from our Tableau plot in that we were unable to do a dual axis plot in ggplot to show the average median income.

Scatter Plots

Map The map is a relatively simple example where each state is colored by the average median income. The darker colored states have a higher income than the lighter colored states. It’s interesting to note that closer to either coast appears to have higher income than the middle of US.

Scatter Plot This plot is a representation of the age vs. median income with average trendlines. The color is based upon the threat level. It’s interesting to note that as median income increases age decreases for the trendline if the threat level was undetermined.

Scatter Plot

Scatter Plot

Dashboard This is based off of other plots already provided/described. This is simply all of the details about the trend model off of the scatter plot.

Dashboard The dashboard represents two different plots a scatter plot example and a histogram example. These plots are also used for actions and further described during other parts of the notebook. This page makes the action easier to see though so that switching between workbook sheets is not required.

  plot(scatterplot)

This scatterplot depicts both median family income and GINI index for that state. It is colored to represent what the criminal was armed with when shot by the police. We can see that there is no correlation between median family income and GINI index, and that criminals with low to average family income tended to be armed with guns. In Shiny, this graph is equipped with actions to zoom in on a set of points or individual points.

Cross Tabs

KPI This plot graphs people who were shot by the police by fleeing and their mental state, and shows their median family income. Of particular note is that mentally ill people who were fleeing on foot had higher median family income

  plot(fleePlot)

This is a barchart depicting income of how people are fleeing, separated by signs of mental illness. The line shows the average of median incomes by mental illness and feeling type.

Set The Gender vs. Race crosstab plot is shown above. This graph shows the race of the individual shot against the individual’s gender. Each cell has the median income colored by race. A set was created by filtering those individuals that were shot in a state whose median family income was between 46,000 and 62,000. The set is what is the smaller text underneath the larger text for the cells where it is appropriate. It is interesting to see that in general the females average median income is higher.

  plot(genderRacePlot)

This graph shows the race of the individual shot against the individual’s gender. A set was created by using the dplyr filter function to separate those individuals that were shot in a state who’s state had a median family income was between 46,000 and 62,000.

Parameter This is a graph of gender of the individual shot vs if that individual showed signs of mental illness. The graph was then colored based on whether that particular individual was shot in area which had high/medium/low per capita income. This is a good example of using a parameter.

  plot(raceFleePlot)

This R visualization was created using the calculated fields of Median(MedianFamilyIncome/PerCapitaIncome) and plotting based on how the individual fled against the race of the individual shot.

Bar Charts

Table Calculations The Median Income of Race Broken up by Gender is shown above. This graph is a representation of a barchart and table calculation. This combines both our data set and the census data set through the variables of race, gender and Median Income respectively. Multiple rows are used used to break down the race into genders for each field. A calculated table parameter is used specifically the avg_median_income - window_avg_income. This parameter is the color in addition to the text and helps show that the males have relatively high numbers for this parameter even though they may have relatively low avg_median_income on it’s own compared to the females.

  plot(incomeByRacePlot)

The Median Income of Race Broken up by Gender is shown above, specifically the R version. This combines both our data set and the census data set through the variables of race, gender and Median Income respectively. A facet is used to break down the race into genders for each field. A calculated table parameter is used specifically the avg_median_income - window_avg_income. This parameter text helps show that the males have relatively high numbers for this parameter even though they may have relatively low avg_median_income on it’s own compared to the females. The numbers are slightly different due to limits on the SQL statement so the process does not take to long to pull the data.

ID Sets Individuals were plotted with their median income. A set was created for high income criminals, who had over 60k in income. These people were then plotted with their state’s GINI score which is shown here. All these high income criminals all had relatively the same GINI score.

  plot(inequalityPlot)

This barchart shows the GINI inequality index of the area criminals are from, using ID-sets to separate high income criminals.

R Shiny

All R visualizations were graphed on crosstabs in R Shiny. Each graph in each of the different tabs. Here is a link to the published shiny application: https://robin-stewart.shinyapps.io/final_project/

Online Shiny Application

Online Shiny Application